Maße der zentralen Tendenz und Streuung
Humboldt-Universität zu Berlin
2023-12-05
Today we will learn…
summarise() function from dplyr
.by groupThe required readings for this topic are:
Ch. 3, Sections 3.4-3.9 (Descriptive statistics, models, and distributions) in Winter (2019) (available online for students/employees of the HU Berlin via the HU Grimm Zentrum.
Section 4.5 (Groups) in Ch. 4 (Data Transformation) in Wickham et al. (2023).
Session > Restart R to start with a fresh environment
Cmd/Ctrl+Strg+0
groesse_geburtstag_ws2324.csv: a slightly changed groesse_geburtstag dataset from last weeklanguageR_english.csv: condensed version of english dataset from the languageR packagenrow(): get number of observations in a dataset[1] 9
mean, or average: the sum of all values divided by the number of values (as in Equation \(\ref{eq-mean}\))\[\begin{align} \mu &= \frac{sum\;of\;values} {n} \label{eq-mean} \end{align}\]
171, 168, 182, 190, 170, 163, 164, 167, 189)[1] 1396
()) before dividing by \(n\)
[1] 173.7778
[1] 173.7778
mean() function.[1] 173.7778
mean() function on a variable in a data frame by using the $ operator (dataframe$variable).[1] 173.6667
sort() function and count which is the middle value:[1] 163 164 167 167 170 171 182 189 190
median()
[1] 170
[1] 163 164 167 167 170 171 182 189 251
[1] 170
[1] 180.4444
mode
[1] 190
[1] 163
range() function[1] 163 190
[1] 27
sd or \(\sigma\))sd) = the square root (\(\sqrt{}\) or sqrt() in R) of the sum of squared value deviations from the mean (\((x - \mu)^2\)) divided by the number of observations minus 1 (\(n-1\))
\[\begin{align} \sigma & = \sqrt{\frac{(x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_N-\mu)^2}{N-1}} \label{eq-sd} \end{align}\]
sd() function[1] 10.46157
\[\begin{align} \sigma_{heights} & = \sqrt{\frac{(height_1-\mu)^2 + (height_2-\mu)^2 + ... (heights_N-\mu)^2}{N-1}} \end{align}\]
3,5,9), our values (\(x\)) are:dplyr package from the tidyverse has some helpful functions to produce summary statisticsdf_eng dataset to learn about these dplyr verbs.dplyr::summarisesummarise() function (dplyr) computes summaries of data
n() function produces the number of observations (only when used inside summarise() or mutate())# A tibble: 1 × 1
N
<int>
1 4568
rt_lexdec, in milliseconds)# A tibble: 1 × 3
mean_lexdec sd_lexdec N
<dbl> <dbl> <int>
1 708. 115. 4568
Missing values
rt_naming has a missing valuemean() function does not work with missing valuesdrop_na()
groesse between L1 speaker groups.by =.by = argument in summarise() computes our calculations on groups within a categorical variable# A tibble: 2 × 4
age_subject mean_lexdec sd_lexdec N
<chr> <dbl> <dbl> <int>
1 young 630. 69.1 2283
2 old 787. 96.2 2284
concatenate (c())# A tibble: 4 × 5
age_subject word_category mean_lexdec sd_lexdec N
<chr> <chr> <dbl> <dbl> <int>
1 old N 790. 101. 1452
2 old V 780. 86.5 832
3 young N 633. 70.8 1451
4 young V 623. 65.7 832
| dataset | mean_x | mean_y |
|---|---|---|
| Dataset 1 | 9 | 7.5 |
| Dataset 2 | 9 | 7.5 |
| Dataset 3 | 9 | 7.5 |
| Dataset 4 | 9 | 7.5 |
mean and sd, but different distributions
| dataset | mean_x | mean_y | std_dev_x | std_dev_y | corr_x_y |
|---|---|---|---|---|---|
| away | 54.27 | 47.83 | 16.77 | 26.94 | -0.06 |
| bullseye | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
| circle | 54.27 | 47.84 | 16.76 | 26.93 | -0.07 |
| dino | 54.26 | 47.83 | 16.77 | 26.94 | -0.06 |
| dots | 54.26 | 47.84 | 16.77 | 26.93 | -0.06 |
| h_lines | 54.26 | 47.83 | 16.77 | 26.94 | -0.06 |
| high_lines | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
| slant_down | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
| slant_up | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
| star | 54.27 | 47.84 | 16.77 | 26.93 | -0.06 |
| v_lines | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
| wide_lines | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
| x_shape | 54.26 | 47.84 | 16.77 | 26.93 | -0.07 |
Abbildung 2: Plots of datasauRus dataset distributions
Today we learned…
summarise() function from dplyr ✅.by group ✅152, 19, 1398, 67, 2111 without using the function sd()
sd() to print the standard deviation of the values above. Did you get it right?summarise, print the mean, standard deviation, and number of observations for dep_delay.
NAs)?.by() argument to find the departure delay (dep_delay) per month
arrange() the output by the mean departure delayCreated with R version 4.3.0 (2023-04-21) (Already Tomorrow) and RStudioversion 2023.9.0.463 (Desert Sunflower).
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] datasauRus_0.1.6 patchwork_1.1.3 janitor_2.2.0 here_1.0.1
[5] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
[9] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[13] ggplot2_3.4.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.4 xfun_0.39 lattice_0.21-8 tzdb_0.4.0
[5] vctrs_0.6.3 tools_4.3.0 generics_0.1.3 parallel_4.3.0
[9] fansi_1.0.4 highr_0.10 pacman_0.5.1 pkgconfig_2.0.3
[13] Matrix_1.5-4 webshot_0.5.4 lifecycle_1.0.3 compiler_4.3.0
[17] farver_2.1.1 munsell_0.5.0 snakecase_0.11.0 htmltools_0.5.5
[21] yaml_2.3.7 pillar_1.9.0 crayon_1.5.2 nlme_3.1-162
[25] tidyselect_1.2.0 rvest_1.0.3 digest_0.6.33 stringi_1.7.12
[29] labeling_0.4.3 splines_4.3.0 rprojroot_2.0.3 fastmap_1.1.1
[33] grid_4.3.0 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
[37] utf8_1.2.3 withr_2.5.0 scales_1.2.1 bit64_4.0.5
[41] timechange_0.2.0 rmarkdown_2.22 httr_1.4.6 bit_4.0.5
[45] hms_1.1.3 kableExtra_1.3.4 evaluate_0.21 knitr_1.44
[49] viridisLite_0.4.2 mgcv_1.8-42 rlang_1.1.1 glue_1.6.2
[53] xml2_1.3.4 svglite_2.1.1 rstudioapi_0.14 vroom_1.6.3
[57] jsonlite_1.8.7 R6_2.5.1 systemfonts_1.0.4
Woche 9 - Deskriptive Statistik